Udemy is one of the most popular e-learning platforms in the world. As mentioned on their website, the platform has over 75 000 instructors,
150 000 courses, 250 million enrollments and 33 million minutes worth of content.
The Udemy Dataset has information about the courses avaliable on Udemy from the years 2011-2017.
This Dataset is available on Kaggle website for free. (https://www.kaggle.com/andrewmvd/udemy-courses)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import neattext.functions as nfx
data = pd.read_csv('datasets/udemy_courses.csv')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3678 entries, 0 to 3677 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 course_id 3678 non-null int64 1 course_title 3678 non-null object 2 url 3678 non-null object 3 is_paid 3678 non-null bool 4 price 3678 non-null int64 5 num_subscribers 3678 non-null int64 6 num_reviews 3678 non-null int64 7 num_lectures 3678 non-null int64 8 level 3678 non-null object 9 content_duration 3678 non-null float64 10 published_timestamp 3678 non-null object 11 subject 3678 non-null object dtypes: bool(1), float64(1), int64(5), object(5) memory usage: 319.8+ KB
data.head()
| course_id | course_title | url | is_paid | price | num_subscribers | num_reviews | num_lectures | level | content_duration | published_timestamp | subject | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1070968 | Ultimate Investment Banking Course | https://www.udemy.com/ultimate-investment-bank... | True | 200 | 2147 | 23 | 51 | All Levels | 1.5 | 2017-01-18T20:58:58Z | Business Finance |
| 1 | 1113822 | Complete GST Course & Certification - Grow You... | https://www.udemy.com/goods-and-services-tax/ | True | 75 | 2792 | 923 | 274 | All Levels | 39.0 | 2017-03-09T16:34:20Z | Business Finance |
| 2 | 1006314 | Financial Modeling for Business Analysts and C... | https://www.udemy.com/financial-modeling-for-b... | True | 45 | 2174 | 74 | 51 | Intermediate Level | 2.5 | 2016-12-19T19:26:30Z | Business Finance |
| 3 | 1210588 | Beginner to Pro - Financial Analysis in Excel ... | https://www.udemy.com/complete-excel-finance-c... | True | 95 | 2451 | 11 | 36 | All Levels | 3.0 | 2017-05-30T20:07:24Z | Business Finance |
| 4 | 1011058 | How To Maximize Your Profits Trading Options | https://www.udemy.com/how-to-maximize-your-pro... | True | 200 | 1276 | 45 | 26 | Intermediate Level | 2.0 | 2016-12-13T14:57:18Z | Business Finance |
data.level.value_counts()
All Levels 1929 Beginner Level 1270 Intermediate Level 421 Expert Level 58 Name: level, dtype: int64
data.isnull().sum()
course_id 0 course_title 0 url 0 is_paid 0 price 0 num_subscribers 0 num_reviews 0 num_lectures 0 level 0 content_duration 0 published_timestamp 0 subject 0 dtype: int64
data.published_timestamp = pd.to_datetime(data.published_timestamp).dt.date.astype('datetime64[ns]')
data.published_timestamp
0 2017-01-18
1 2017-03-09
2 2016-12-19
3 2017-05-30
4 2016-12-13
...
3673 2016-06-14
3674 2017-03-10
3675 2015-12-30
3676 2016-08-11
3677 2014-09-28
Name: published_timestamp, Length: 3678, dtype: datetime64[ns]
data.subject.unique()
array(['Business Finance', 'Graphic Design', 'Musical Instruments',
'Web Development'], dtype=object)
subject = data.subject.value_counts().to_frame('count').reset_index().rename(columns={'index': 'name'}).sort_values(by='count')
subject
| name | count | |
|---|---|---|
| 3 | Graphic Design | 603 |
| 2 | Musical Instruments | 680 |
| 1 | Business Finance | 1195 |
| 0 | Web Development | 1200 |
fig = px.bar(subject,
x='name',
y='count',
color='name',
color_discrete_sequence=px.colors.sequential.Blues_r,
opacity=0.8,
title='Value Count of Subject Types'
)
fig.update_traces(marker_line_color='black',
marker_line_width=1)
fig.update_layout(showlegend=False, width=650)
fig.show()
fig = px.pie(subject,
values='count',
names='name',
hole=0.5,
color_discrete_sequence=px.colors.sequential.Blues_r,
title='Subject Types [%]')
fig.update_layout(width=600)
fig.show()
y = data.published_timestamp.dt.year
m = data.published_timestamp.dt.month
subject_year = data[['published_timestamp']].sort_values(by=['published_timestamp'])
subject_year['count_m'] = subject_year.groupby([y, m])['published_timestamp'].transform('size')
subject_year.count_m = subject_year.count_m.cumsum()
subject_year['type'] = 'total'
for val in data.subject.unique():
temp = data[data.subject == val]
val_df = temp[['published_timestamp']].sort_values(by=['published_timestamp'])
val_df['count_m'] = val_df.groupby([y, m])['published_timestamp'].transform('size')
val_df.count_m = val_df.count_m.cumsum()
val_df['type'] = val
subject_year = subject_year.append(val_df, ignore_index=True)
subject_year
| published_timestamp | count_m | type | |
|---|---|---|---|
| 0 | 2011-07-09 | 1 | total |
| 1 | 2011-09-09 | 2 | total |
| 2 | 2011-11-19 | 4 | total |
| 3 | 2011-11-29 | 6 | total |
| 4 | 2011-12-20 | 7 | total |
| ... | ... | ... | ... |
| 7351 | 2017-06-29 | 37406 | Web Development |
| 7352 | 2017-06-30 | 37447 | Web Development |
| 7353 | 2017-06-30 | 37488 | Web Development |
| 7354 | 2017-07-03 | 37490 | Web Development |
| 7355 | 2017-07-06 | 37492 | Web Development |
7356 rows × 3 columns
fig = px.line(subject_year[subject_year.type == 'total'],
x='published_timestamp',
y='count_m',
title='All Subjects Distribution')
fig.update_traces(fill='tozeroy',
line=dict(color="steelblue", width=2))
fig.update_layout(width=600)
fig.show()
fig = px.line(subject_year[subject_year.type != 'total'],
x='published_timestamp',
y='count_m',
color='type',
title='Subject Types Distribution')
fig.update_traces(line=dict(width=2))
fig.update_layout(width=700)
fig.show()
subject_subscribers = data.groupby('subject')['num_subscribers'].sum().to_frame('count').reset_index().rename(columns={'subject': 'name'}).sort_values(by='count')
subject_subscribers
| name | count | |
|---|---|---|
| 2 | Musical Instruments | 846689 |
| 1 | Graphic Design | 1063148 |
| 0 | Business Finance | 1868711 |
| 3 | Web Development | 7980572 |
fig = px.bar(subject_subscribers,
x='name',
y='count',
color='name',
color_discrete_sequence=px.colors.sequential.Blues_r,
opacity=0.8,
title='Subscriber Count vs Subject Type'
)
fig.update_traces(marker_line_color='black',
marker_line_width=1)
fig.update_layout(showlegend=False, width=650)
fig.show()
fig = px.pie(subject_subscribers,
values='count',
names='name',
hole=0.5,
color_discrete_sequence=px.colors.sequential.Blues_r,
title='Subscriber Count vs Subject Type [%]')
fig.update_layout(width=600)
fig.show()
subject_duration = data.groupby(['subject', 'is_paid'])['content_duration'].mean().to_frame('mean_duration').reset_index().rename(columns={'subject': 'subject_name'})
subject_duration
| subject_name | is_paid | mean_duration | |
|---|---|---|---|
| 0 | Business Finance | False | 2.148611 |
| 1 | Business Finance | True | 3.675675 |
| 2 | Graphic Design | False | 1.917619 |
| 3 | Graphic Design | True | 3.683011 |
| 4 | Musical Instruments | False | 1.547101 |
| 5 | Musical Instruments | True | 2.949238 |
| 6 | Web Development | False | 2.562281 |
| 7 | Web Development | True | 5.972790 |
fig = px.bar(subject_duration,
x='subject_name',
y='mean_duration',
color='is_paid',
barmode='group',
color_discrete_sequence=px.colors.sequential.Blues_r,
opacity=0.8,
title='Mean Duration vs Subject Type'
)
fig.update_traces(marker_line_color='black',
marker_line_width=1)
fig.update_layout(width=650)
fig.show()
data.level.unique()
array(['All Levels', 'Intermediate Level', 'Beginner Level',
'Expert Level'], dtype=object)
levels = data.level.value_counts().to_frame('count').reset_index().rename(columns={'index': 'name'}).sort_values(by='count')
levels
| name | count | |
|---|---|---|
| 3 | Expert Level | 58 |
| 2 | Intermediate Level | 421 |
| 1 | Beginner Level | 1270 |
| 0 | All Levels | 1929 |
fig = px.bar(levels,
x='name',
y='count',
color='name',
color_discrete_sequence=px.colors.sequential.Blues_r,
opacity=0.8,
title='Value Count of Level Types'
)
fig.update_traces(marker_line_color='black',
marker_line_width=1)
fig.update_layout(showlegend=False, width=650)
fig.show()
level_subject = data.groupby(['subject', 'level']).size().to_frame('count').reset_index().rename(columns={'subject': 'subject_name'})
level_subject
| subject_name | level | count | |
|---|---|---|---|
| 0 | Business Finance | All Levels | 696 |
| 1 | Business Finance | Beginner Level | 340 |
| 2 | Business Finance | Expert Level | 31 |
| 3 | Business Finance | Intermediate Level | 128 |
| 4 | Graphic Design | All Levels | 298 |
| 5 | Graphic Design | Beginner Level | 243 |
| 6 | Graphic Design | Expert Level | 5 |
| 7 | Graphic Design | Intermediate Level | 57 |
| 8 | Musical Instruments | All Levels | 276 |
| 9 | Musical Instruments | Beginner Level | 296 |
| 10 | Musical Instruments | Expert Level | 7 |
| 11 | Musical Instruments | Intermediate Level | 101 |
| 12 | Web Development | All Levels | 659 |
| 13 | Web Development | Beginner Level | 391 |
| 14 | Web Development | Expert Level | 15 |
| 15 | Web Development | Intermediate Level | 135 |
fig = px.bar(level_subject.sort_values(by='count', ascending=False),
x='subject_name',
y='count',
color='level',
barmode='group',
color_discrete_sequence=px.colors.sequential.Blues_r,
opacity=0.8,
title='Level vs Subject Type'
)
fig.update_traces(marker_line_color='black',
marker_line_width=1)
fig.update_layout(width=650)
fig.show()
level_subscribers = data.groupby('level')['num_subscribers'].sum().to_frame('count').reset_index().rename(columns={'level': 'name'}).sort_values(by='count')
level_subscribers
| name | count | |
|---|---|---|
| 2 | Expert Level | 50196 |
| 3 | Intermediate Level | 742005 |
| 1 | Beginner Level | 4051843 |
| 0 | All Levels | 6915076 |
fig = px.bar(level_subscribers,
x='name',
y='count',
color='name',
color_discrete_sequence=px.colors.sequential.Blues_r,
opacity=0.8,
title='Subscriber Count vs Level'
)
fig.update_traces(marker_line_color='black',
marker_line_width=1)
fig.update_layout(showlegend=False, width=650)
fig.show()
data[data.is_paid == True].price.mean()
72.12885985748218
price_subject = data[data.is_paid == True].groupby(['subject'])['price'].mean().to_frame('mean_price').reset_index().rename(columns={'subject': 'subject_name'}).sort_values(by='mean_price')
price_subject
| subject_name | mean_price | |
|---|---|---|
| 2 | Musical Instruments | 53.154574 |
| 1 | Graphic Design | 61.390845 |
| 0 | Business Finance | 74.540491 |
| 3 | Web Development | 86.635426 |
fig = px.bar(price_subject,
x='subject_name',
y='mean_price',
color='mean_price',
color_continuous_scale=px.colors.sequential.Blues,
opacity=0.8,
title='Mean Price vs Subject Type'
)
fig.update_traces(marker_line_color='black',
marker_line_width=1)
fig.update_layout(width=700)
fig.show()
data['profit'] = data.price * data.num_subscribers
data['year'] = data.published_timestamp.dt.year
price_year = data.groupby('year')['profit'].sum().to_frame('income').reset_index()
price_year
| year | income | |
|---|---|---|
| 0 | 2011 | 11643420 |
| 1 | 2012 | 11773470 |
| 2 | 2013 | 72652195 |
| 3 | 2014 | 106939045 |
| 4 | 2015 | 314510395 |
| 5 | 2016 | 276633190 |
| 6 | 2017 | 90769600 |
fig = px.bar(price_year,
x='year',
y='income',
color='income',
color_continuous_scale=px.colors.sequential.Blues,
opacity=0.8,
title='Annual Income'
)
fig.update_traces(marker_line_color='black',
marker_line_width=1)
fig.update_layout(width=700)
fig.show()
most_profitable = data.loc[:, ['course_title', 'profit', 'subject']].sort_values(by='profit', ascending=False)[:15]
most_profitable
| course_title | profit | subject | |
|---|---|---|---|
| 3230 | The Web Developer Bootcamp | 24316800 | Web Development |
| 3232 | The Complete Web Developer Course 2.0 | 22902400 | Web Development |
| 1979 | Pianoforall - Incredible New Way To Learn Pian... | 15099800 | Musical Instruments |
| 3204 | Angular 4 (formerly Angular 2) - The Complete ... | 14018770 | Web Development |
| 3247 | JavaScript: Understanding the Weird Parts | 13932100 | Web Development |
| 3251 | Learn and Understand NodeJS | 11350560 | Web Development |
| 2662 | The Complete HTML & CSS Course - From Novice T... | 11197290 | Web Development |
| 3175 | Complete PHP Course With Bootstrap3 CMS System... | 10789740 | Web Development |
| 3246 | Learn and Understand AngularJS | 10388175 | Web Development |
| 3254 | Modern React with Redux | 9146700 | Web Development |
| 3249 | Build Responsive Real World Websites with HTML... | 8575515 | Web Development |
| 2701 | Become a Web Developer from Scratch | 8302320 | Web Development |
| 3205 | Build Websites from Scratch with HTML & CSS | 7432265 | Web Development |
| 1213 | Photoshop for Entrepreneurs - Design 11 Practi... | 7257600 | Graphic Design |
| 2806 | The Complete Web Developer Masterclass: Beginn... | 7242495 | Web Development |
fig = px.bar(most_profitable.sort_values(by='profit'),
x='profit',
y='course_title',
orientation='h',
color='subject',
color_discrete_sequence=px.colors.sequential.Blues_r,
opacity=0.8,
title='Most Profitable Courses'
)
fig.update_traces(marker_line_color='black',
marker_line_width=1)
fig.update_layout(yaxis_categoryorder = 'total ascending')
fig.show()
data['clean_title'] = data.course_title.apply(nfx.remove_stopwords).apply(nfx.remove_special_characters)
titles_cleaned = data.clean_title.to_list()
title_words = []
for title in titles_cleaned:
for word in title.split():
title_words.append(word)
from collections import Counter
word_freq = Counter(title_words)
series = pd.Series(dict(word_freq.most_common(20)))
series.to_frame('count')
| count | |
|---|---|
| Learn | 491 |
| Trading | 280 |
| Beginners | 246 |
| Course | 231 |
| Guitar | 208 |
| Web | 205 |
| Design | 187 |
| Complete | 181 |
| Piano | 177 |
| Photoshop | 166 |
| Forex | 163 |
| Build | 161 |
| Financial | 138 |
| Create | 135 |
| JavaScript | 123 |
| Beginner | 120 |
| Guide | 116 |
| HTML | 116 |
| Accounting | 111 |
| Website | 110 |
from wordcloud import WordCloud
wordcloud_data = ' '.join(title_words)
wordcloud = WordCloud(width = 800, height = 800,
min_font_size = 10).generate(wordcloud_data)
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
most_famous = data.loc[:, ['course_title', 'num_subscribers']].sort_values(by='num_subscribers', ascending=False)[:15]
most_famous
| course_title | num_subscribers | |
|---|---|---|
| 2827 | Learn HTML5 Programming From Scratch | 268923 |
| 3032 | Coding for Entrepreneurs Basic | 161029 |
| 3230 | The Web Developer Bootcamp | 121584 |
| 2783 | Build Your First Website in 1 Week with HTML5 ... | 120291 |
| 3232 | The Complete Web Developer Course 2.0 | 114512 |
| 1896 | Free Beginner Electric Guitar Lessons | 101154 |
| 2589 | Web Design for Web Developers: Build Beautiful... | 98867 |
| 2619 | Learn Javascript & JQuery From Scratch | 84897 |
| 3289 | Practical PHP: Master the Basics and Code Dyna... | 83737 |
| 3247 | JavaScript: Understanding the Weird Parts | 79612 |
| 1979 | Pianoforall - Incredible New Way To Learn Pian... | 75499 |
| 3204 | Angular 4 (formerly Angular 2) - The Complete ... | 73783 |
| 3665 | Beginner Photoshop to HTML5 and CSS3 | 73110 |
| 2782 | Web Development By Doing: HTML / CSS From Scratch | 72932 |
| 3325 | HTML and CSS for Beginners - Build a Website &... | 70773 |
fig = px.bar(most_famous.sort_values(by='num_subscribers'),
x='num_subscribers',
y='course_title',
orientation='h',
color='num_subscribers',
color_continuous_scale=px.colors.sequential.Blues,
opacity=0.8,
title='Most Popular Courses'
)
fig.update_traces(marker_line_color='black',
marker_line_width=1)
fig.show()